AAAI.2021 - Computer Vision

Total: 309

#1 Localization in the Crowd with Topological Constraints [PDF] [Copy] [Kimi]

Authors: Shahira Abousamra ; Minh Hoai ; Dimitris Samaras ; Chao Chen

We address the problem of crowd localization, i.e., the prediction of dots corresponding to people in a crowded scene. Due to various challenges, a localization method is prone to spatial semantic errors, i.e., predicting multiple dots within a same person or collapsing multiple dots in a cluttered region. We propose a topological approach targeting these semantic errors. We introduce a topological constraint that teaches the model to reason about the spatial arrangement of dots. To enforce this constraint, we define a persistence loss based on the theory of persistent homology. The loss compares the topographic landscape of the likelihood map and the topology of the ground truth. Topological reasoning improves the quality of the localization algorithm especially near cluttered regions. On multiple public benchmarks, our method outperforms previous localization methods. Additionally, we demonstrate the potential of our method in improving the performance in the crowd counting task.

#2 Deep Event Stereo Leveraged by Event-to-Image Translation [PDF] [Copy] [Kimi]

Authors: Soikat Hasan Ahmed ; Hae Woong Jang ; S M Nadim Uddin ; Yong Ju Jung

Depth estimation in real-world applications requires precise responses to fast motion and challenging lighting conditions. Event cameras use bio-inspired event-driven sensors that provide instantaneous and asynchronous information of pixel-level log intensity changes, which makes them suitable for depth estimation in such challenging conditions. However, as the event cameras primarily provide asynchronous and spatially sparse event data, it is hard to provide accurate dense disparity map in stereo event camera setups - especially in estimating disparities on local structures or edges. In this study, we develop a novel deep event stereo network that reconstructs spatial intensity image features from embedded event streams and leverages the event features using the reconstructed image features to compute dense disparity maps. To this end, we propose a novel event-to-image translation network with a cross-semantic attention mechanism that calculates the global semantic context of the event features for the intensity image reconstruction. In addition, a feature aggregation module is developed for accurate disparity estimation, which modulates the event features with the reconstructed image features by a stacked dilated spatially-adaptive denormalization mechanism. Experimental results reveal that our method can outperform the state-of-the-art methods by significant margins both in quantitative and qualitative measures.

#3 Optical Flow Estimation from a Single Motion-blurred Image [PDF] [Copy] [Kimi]

Authors: Dawit Mureja Argaw ; Junsik Kim ; Francois Rameau ; Jae Won Cho ; In So Kweon

In most of computer vision applications, motion blur is regarded as an undesirable artifact. However, it has been shown that motion blur in an image may have practical interests in fundamental computer vision problems. In this work, we propose a novel framework to estimate optical flow from a single motion-blurred image in an end-to-end manner. We design our network with transformer networks to learn globally and locally varying motions from encoded features of a motion-blurred input, and decode left and right frame features without explicit frame supervision. A flow estimator network is then used to estimate optical flow from the decoded features in a coarse-to-fine manner. We qualitatively and quantitatively evaluate our model through a large set of experiments on synthetic and real motion-blur datasets. We also provide in-depth analysis of our model in connection with related approaches to highlight the effectiveness and favorability of our approach. Furthermore, we showcase the applicability of the flow estimated by our method on deblurring and moving object segmentation tasks.

#4 Motion-blurred Video Interpolation and Extrapolation [PDF] [Copy] [Kimi]

Authors: Dawit Mureja Argaw ; Junsik Kim ; Francois Rameau ; In So Kweon

Abrupt motion of camera or objects in a scene result in a blurry video, and therefore recovering high quality video requires two types of enhancements: visual enhancement and temporal upsampling. A broad range of research attempted to recover clean frames from blurred image sequences or temporally upsample frames by interpolation, yet there are very limited studies handling both problems jointly. In this work, we present a novel framework for deblurring, interpolating and extrapolating sharp frames from a motion-blurred video in an end-to-end manner. We design our framework by first learning the pixel-level motion that caused the blur from the given inputs via optical flow estimation and then predict multiple clean frames by warping the decoded features with the estimated flows. To ensure temporal coherence across predicted frames and address potential temporal ambiguity, we propose a simple, yet effective flow-based rule. The effectiveness and favorability of our approach are highlighted through extensive qualitative and quantitative evaluations on motion-blurred datasets from high speed videos.

#5 Disentangled Multi-Relational Graph Convolutional Network for Pedestrian Trajectory Prediction [PDF] [Copy] [Kimi]

Authors: Inhwan Bae ; Hae-Gon Jeon

Pedestrian trajectory prediction is one of the important tasks required for autonomous navigation and social robots in human environments. Previous studies focused on estimating social forces among individual pedestrians. However, they did not consider the social forces of groups on pedestrians, which results in over-collision avoidance problems. To address this problem, we present a Disentangled Multi-Relational Graph Convolutional Network (DMRGCN) for socially entangled pedestrian trajectory prediction. We first introduce a novel disentangled multi-scale aggregation to better represent social interactions, among pedestrians on a weighted graph. For the aggregation, we construct the multi-relational weighted graphs based on distances and relative displacements among pedestrians. In the prediction step, we propose a global temporal aggregation to alleviate accumulated errors for pedestrians changing their directions. Finally, we apply DropEdge into our DMRGCN to avoid the over-fitting issue on relatively small pedestrian trajectory datasets. Through the effective incorporation of the three parts within an end-to-end framework, DMRGCN achieves state-of-the-art performances on a variety of challenging trajectory prediction benchmarks.

#6 Dense Events Grounding in Video [PDF] [Copy] [Kimi]

Authors: Peijun Bao ; Qian Zheng ; Yadong Mu

This paper explores a novel setting of temporal sentence grounding for the first time, dubbed as dense events grounding. Given an untrimmed video and a paragraph description, dense events grounding aims to jointly localize temporal moments of multiple events described in the paragraph. Our main motivating fact is that multiple events to be grounded in a video are often semantically related and temporally coordinated according to their order appearing in the paragraph. This fact sheds light on devising more accurate visual grounding model. In this work, we propose Dense Events Propagation Network (DepNet) for this novel task. DepNet first adaptively aggregates temporal and semantic information of dense events into a compact set through a second-order attention pooling, then selectively propagates the aggregated information to each single event with soft attention. Based on such aggregation-and-propagation mechanism, DepNet can effectively exploit both the temporal order and semantic relations of dense events. We conduct comprehensive experiments on large-scale datasets ActivityNet Captions and TACoS. For fair comparisons, our evaluations include both state-of-art single-event grounding methods and their natural extensions to the dense-events grounding setting implemented by us. All experiments clearly shows the performance superiority of the proposed DepNet by significant margins.

#7 Context-aware Attentional Pooling (CAP) for Fine-grained Visual Classification [PDF] [Copy] [Kimi]

Authors: Ardhendu Behera ; Zachary Wharton ; Pradeep R P G Hewage ; Asish Bera

Deep convolutional neural networks (CNNs) have shown a strong ability in mining discriminative object pose and parts information for image recognition. For fine-grained recognition, context-aware rich feature representation of object/scene plays a key role since it exhibits a significant variance in the same subcategory and subtle variance among different subcategories. Finding the subtle variance that fully characterizes the object/scene is not straightforward. To address this, we propose a novel context-aware attentional pooling (CAP) that effectively captures subtle changes via sub-pixel gradients, and learns to attend informative integral regions and their importance in discriminating different subcategories without requiring the bounding-box and/or distinguishable part annotations. We also introduce a novel feature encoding by considering the intrinsic consistency between the informativeness of the integral regions and their spatial structures to capture the semantic correlation among them. Our approach is simple yet extremely effective and can be easily applied on top of a standard classification backbone network. We evaluate our approach using six state-of-the-art (SotA) backbone networks and eight benchmark datasets. Our method significantly outperforms the SotA approaches on six datasets and is very competitive with the remaining two.

#8 Appearance-Motion Memory Consistency Network for Video Anomaly Detection [PDF] [Copy] [Kimi]

Authors: Ruichu Cai ; Hao Zhang ; Wen Liu ; Shenghua Gao ; Zhifeng Hao

Abnormal event detection in the surveillance video is an essential but challenging task, and many methods have been proposed to deal with this problem. The previous methods either only consider the appearance information or directly integrate the results of appearance and motion information without considering their endogenous consistency semantics explicitly. Inspired by the rule humans identify the abnormal frames from multi-modality signals, we propose an Appearance-Motion Memory Consistency Network (AMMC-Net). Our method first makes full use of the prior knowledge of appearance and motion signals to explicitly capture the correspondence between them in the high-level feature space. Then, it combines the multi-view features to obtain a more essential and robust feature representation of regular events, which can significantly increase the gap between an abnormal and a regular event. In the anomaly detection phase, we further introduce a commit error in the latent space joint with the prediction error in pixel space to enhance the detection accuracy. Solid experimental results on various standard datasets validate the effectiveness of our approach.

#9 Rethinking Object Detection in Retail Stores [PDF] [Copy] [Kimi]

Authors: Yuanqiang Cai ; Longyin Wen ; Libo Zhang ; Dawei Du ; Weiqiang Wang

The conventional standard for object detection uses a bounding box to represent each individual object instance. However, it is not practical in the industry-relevant applications in the context of warehouses due to severe occlusions among groups of instances of the same categories. In this paper, we propose a new task, i.e., simultaneously object localization and counting, abbreviated as Locount, which requires algorithms to localize groups of objects of interest with the number of instances. However, there does not exist a dataset or benchmark designed for such a task. To this end, we collect a large-scale object localization and counting dataset with rich annotations in retail stores, which consists of 50,394 images with more than 1.9 million object instances in 140 categories. Together with this dataset, we provide a new evaluation protocol and divide the training and testing subsets to fairly evaluate the performance of algorithms for Locount, developing a new benchmark for the Locount task. Moreover, we present a cascaded localization and counting network as a strong baseline, which gradually classifies and regresses the bounding boxes of objects with the predicted numbers of instances enclosed in the bounding boxes, trained in an end-to-end manner. Extensive experiments are conducted on the proposed dataset to demonstrate its significance and the analysis is provided to indicate future directions. Dataset is available at https://isrc.iscas.ac.cn/gitlab/research/locount-dataset.

#10 YOLObile: Real-Time Object Detection on Mobile Devices via Compression-Compilation Co-Design [PDF] [Copy] [Kimi]

Authors: Yuxuan Cai ; Hongjia Li ; Geng Yuan ; Wei Niu ; Yanyu Li ; Xulong Tang ; Bin Ren ; Yanzhi Wang

The rapid development and wide utilization of object detection techniques have aroused attention on both accuracy and speed of object detectors. However, the current state-of-the-art object detection works are either accuracy-oriented using a large model but leading to high latency or speed-oriented using a lightweight model but sacrificing accuracy. In this work, we propose YOLObile framework, a real-time object detection on mobile devices via compression-compilation co-design. A novel block-punched pruning scheme is proposed for any kernel size. To improve computational efficiency on mobile devices, a GPU-CPU collaborative scheme is adopted along with advanced compiler-assisted optimizations. Experimental results indicate that our pruning scheme achieves 14x compression rate of YOLOv4 with 49.0 mAP. Under our YOLObile framework, we achieve 17 FPS inference speed using GPU on Samsung Galaxy S20. By incorporating our proposed GPU-CPU collaborative scheme, the inference speed is increased to 19.1 FPS, and outperforms the original YOLOv4 by 5x speedup. Source code is at: https://github.com/nightsnack/YOLObile.

#11 Semantic MapNet: Building Allocentric Semantic Maps and Representations from Egocentric Views [PDF] [Copy] [Kimi]

Authors: Vincent Cartillier ; Zhile Ren ; Neha Jain ; Stefan Lee ; Irfan Essa ; Dhruv Batra

We study the task of semantic mapping – specifically, an embodied agent (a robot or an egocentric AI assistant) is given a tour of a new environment and asked to build an allocentric top-down semantic map (‘what is where?’) from egocentric observations of an RGB-D camera with known pose (via localization sensors). Importantly, our goal is to build neural episodic memories and spatio-semantic representations of 3D spaces that enable the agent to easily learn subsequent tasks in the same space – navigating to objects seen during the tour (‘Find chair’) or answering questions about the space (‘How many chairs did you see in the house?’). Towards this goal, we present Semantic MapNet (SMNet), which consists of: (1) an Egocentric Visual Encoder that encodes each egocentric RGB-D frame, (2) a Feature Projector that projects egocentric features to appropriate locations on a floor-plan, (3) a Spatial Memory Tensor of size floor-plan length×width×feature-dims that learns to accumulate projected egocentric features, and (4) a Map Decoder that uses the memory tensor to produce semantic top-down maps. SMNet combines the strengths of (known) projective camera geometry and neural representation learning. On the task of semantic mapping in the Matterport3D dataset, SMNet significantly outperforms competitive baselines by 4.01−16.81% (absolute) on mean-IoU and 3.81−19.69% (absolute) on Boundary-F1 metrics. Moreover, we show how to use the spatio-semantic allocentric representations build by SMNet for the task of ObjectNav and Embodied Question Answering. Project page: https://vincentcartillier.github.io/smnet.html.

#12 Understanding Deformable Alignment in Video Super-Resolution [PDF] [Copy] [Kimi]

Authors: Kelvin C.K. Chan ; Xintao Wang ; Ke Yu ; Chao Dong ; Chen Change Loy

Deformable convolution, originally proposed for the adaptation to geometric variations of objects, has recently shown compelling performance in aligning multiple frames and is increasingly adopted for video super-resolution. Despite its remarkable performance, its underlying mechanism for alignment remains unclear. In this study, we carefully investigate the relation between deformable alignment and the classic flow-based alignment. We show that deformable convolution can be decomposed into a combination of spatial warping and convolution. This decomposition reveals the commonality of deformable alignment and flow-based alignment in formulation, but with a key difference in their offset diversity. We further demonstrate through experiments that the increased diversity in deformable alignment yields better-aligned features, and hence significantly improves the quality of video super-resolution output. Based on our observations, we propose an offset-fidelity loss that guides the offset learning with optical flow. Experiments show that our loss successfully avoids the overflow of offsets and alleviates the instability problem of deformable alignment. Aside from the contributions to deformable alignment, our formulation inspires a more flexible approach to introduce offset diversity to flow-based alignment, improving its performance.

#13 Deep Metric Learning with Graph Consistency [PDF] [Copy] [Kimi]

Authors: Binghui Chen ; Pengyu Li ; Zhaoyi Yan ; Biao Wang ; Lei Zhang

Deep Metric Learning (DML) has been more attractive and widely applied in many computer vision tasks, in which a discriminative embedding is requested such that the image features belonging to the same class are gathered together and the ones belonging to different classes are pushed apart. Most existing works insist to learn this discriminative embedding by either devising powerful pair-based loss functions or hard-sample mining strategies. However, in this paper, we start from an another perspective and propose Deep Consistent Graph Metric Learning (CGML) framework to enhance the discrimination of the learned embedding. It is mainly achieved by rethinking the conventional distance constraints as a graph regularization and then introducing a Graph Consistency regularization term, which intends to optimize the feature distribution from a global graph perspective. Inspired by the characteristic of our defined ’Discriminative Graph’, which regards DML from another novel perspective, the Graph Consistency regularization term encourages the sub-graphs randomly sampled from the training set to be consistent. We show that our CGML indeed serves as an efficient technique for learning towards discriminative embedding and is applicable to various popular metric objectives, e.g. Triplet, N-Pair and Binomial losses. This paper empirically and experimentally demonstrates the effectiveness of our graph regularization idea, achieving competitive results on the popular CUB, CARS, Stanford Online Products and In-Shop datasets.

#14 CNN Profiler on Polar Coordinate Images for Tropical Cyclone Structure Analysis [PDF] [Copy] [Kimi]

Authors: Boyo Chen ; Buo-Fu Chen ; Chun Min Hsiao

Convolutional neural networks (CNN) have achieved great success in analyzing tropical cyclones (TC) with satellite images in several tasks, such as TC intensity estimation. In contrast, TC structure, which is conventionally described by a few parameters estimated subjectively by meteorology specialists, is still hard to be profiled objectively and routinely. This study applies CNN on satellite images to create the entire TC structure profiles, covering all the structural parameters. By utilizing the meteorological domain knowledge to construct TC wind profiles based on historical structure parameters, we provide valuable labels for training in our newly released benchmark dataset. With such a dataset, we hope to attract more attention to this crucial issue among data scientists. Meanwhile, a baseline is established based on a specialized convolutional model operating on polar-coordinates. We discovered that it is more feasible and physically reasonable to extract structural information on polar-coordinates, instead of Cartesian coordinates, according to a TC’s rotational and spiral natures. Experimental results on the released benchmark dataset verified the robustness of the proposed model and demonstrated the potential for applying deep learning techniques for this barely developed yet important topic. For codes and implementation details, please visit https://github.com/BoyoChen/TCSA-CNN-profiler.

#15 Commonsense Knowledge Aware Concept Selection For Diverse and Informative Visual Storytelling [PDF] [Copy] [Kimi]

Authors: Hong Chen ; Yifei Huang ; Hiroya Takamura ; Hideki Nakayama

Visual storytelling is a task of generating relevant and interesting stories for given image sequences. In this work we aim at increasing the diversity of the generated stories while preserving the informative content from the images. We propose to foster the diversity and informativeness of a generated story by using a concept selection module that suggests a set of concept candidates. Then, we utilize a large scale pre-trained model to convert concepts and images into full stories. To enrich the candidate concepts, a commonsense knowledge graph is created for each image sequence from which the concept candidates are proposed. To obtain appropriate concepts from the graph, we propose two novel modules that consider the correlation among candidate concepts and the image-concept correlation. Extensive automatic and human evaluation results demonstrate that our model can produce reasonable concepts. This enables our model to outperform the previous models by a large margin on the diversity and informativeness of the story, while retaining the relevance of the story to the image sequence.

#16 Attention-based Multi-Level Fusion Network for Light Field Depth Estimation [PDF] [Copy] [Kimi]

Authors: Jiaxin Chen ; Shuo Zhang ; Youfang Lin

Depth estimation from Light Field (LF) images is a crucial basis for LF related applications. Since multiple views with abundant information are available, how to effectively fuse features of these views is a key point for accurate LF depth estimation. In this paper, we propose a novel attention-based multi-level fusion network. Combined with the four-branch structure, we design intra-branch fusion strategy and inter-branch fusion strategy to hierarchically fuse effective features from different views. By introducing the attention mechanism, features of views with less occlusions and richer textures are selected inside and between these branches to provide more effective information for depth estimation. The depth maps are finally estimated after further aggregation. Experimental results shows the proposed method achieves state-of-the-art performance in both quantitative and qualitative evaluation, which also ranks first in the commonly used HCI 4D Light Field Benchmark.

#17 Joint Demosaicking and Denoising in the Wild: The Case of Training Under Ground Truth Uncertainty [PDF] [Copy] [Kimi]

Authors: Jierun Chen ; Song Wen ; S.-H. Gary Chan

Image demosaicking and denoising are the two key fundamental steps in digital camera pipelines, aiming to reconstruct clean color images from noisy luminance readings. In this paper, we propose and study Wild-JDD, a novel learning framework for joint demosaicking and denoising in the wild. In contrast to previous works which generally assume the ground truth of training data is a perfect reflection of the reality, we consider here the more common imperfect case of ground truth uncertainty in the wild. We first illustrate its manifestation as various kinds of artifacts including zipper effect, color moire and residual noise. Then we formulate a two-stage data degradation process to capture such ground truth uncertainty, where a conjugate prior distribution is imposed upon a base distribution. After that, we derive an evidence lower bound (ELBO) loss to train a neural network that approximates the parameters of the conjugate prior distribution conditioned on the degraded input. Finally, to further enhance the performance for out-of-distribution input, we design a simple but effective fine-tuning strategy by taking the input as a weakly informative prior. Taking into account ground truth uncertainty, Wild-JDD enjoys good interpretability during optimization. Extensive experiments validate that it outperforms state-of-the-art schemes on joint demosaicking and denoising tasks on both synthetic and realistic raw datasets.

#18 Spatial-temporal Causal Inference for Partial Image-to-video Adaptation [PDF] [Copy] [Kimi]

Authors: Jin Chen ; Xinxiao Wu ; Yao Hu ; Jiebo Luo

Image-to-video adaptation leverages off-the-shelf learned models in labeled images to help classification in unlabeled videos, thus alleviating the high computation overhead of training a video classifier from scratch. This task is very challenging since there exist two types of domain shifts between images and videos: 1) spatial domain shift caused by static appearance variance between images and video frames, and 2) temporal domain shift caused by the absence of dynamic motion in images. Moreover, for different video classes, these two domain shifts have different effects on the domain gap and should not be treated equally during adaptation. In this paper, we propose a spatial-temporal causal inference framework for image-to-video adaptation. We first construct a spatial-temporal causal graph to infer the effects of the spatial and temporal domain shifts by performing counterfactual causality. We then learn causality-guided bidirectional heterogeneous mappings between images and videos to adaptively reduce the two domain shifts. Moreover, to relax the assumption that the label spaces of the image and video domains are the same by the existing methods, we incorporate class-wise alignment into the learning of image-video mappings to perform partial image-to-video adaptation where the image label space subsumes the video label space. Extensive experiments on several video datasets have validated the effectiveness of our proposed method.

#19 Ref-NMS: Breaking Proposal Bottlenecks in Two-Stage Referring Expression Grounding [PDF] [Copy] [Kimi]

Authors: Long Chen ; Wenbo Ma ; Jun Xiao ; Hanwang Zhang ; Shih-Fu Chang

The prevailing framework for solving referring expression grounding is based on a two-stage process: 1) detecting proposals with an object detector and 2) grounding the referent to one of the proposals. Existing two-stage solutions mostly focus on the grounding step, which aims to align the expressions with the proposals. In this paper, we argue that these methods overlook an obvious mismatch between the roles of proposals in the two stages: they generate proposals solely based on the detection confidence (i.e., expression-agnostic), hoping that the proposals contain all right instances in the expression (i.e., expression-aware). Due to this mismatch, current two-stage methods suffer from a severe performance drop between detected and ground-truth proposals. To this end, we propose Ref-NMS, which is the first method to yield expression-aware proposals at the first stage. Ref-NMS regards all nouns in the expression as critical objects, and introduces a lightweight module to predict a score for aligning each box with a critical object. These scores can guide the NMS operation to filter out the boxes irrelevant to the expression, increasing the recall of critical objects, resulting in a significantly improved grounding performance. Since Ref- NMS is agnostic to the grounding step, it can be easily integrated into any state-of-the-art two-stage method. Extensive ablation studies on several backbones, benchmarks, and tasks consistently demonstrate the superiority of Ref-NMS. Codes are available at: https://github.com/ChopinSharp/ref-nms.

#20 RSPNet: Relative Speed Perception for Unsupervised Video Representation Learning [PDF] [Copy] [Kimi]

Authors: Peihao Chen ; Deng Huang ; Dongliang He ; Xiang Long ; Runhao Zeng ; Shilei Wen ; Mingkui Tan ; Chuang Gan

We study unsupervised video representation learning that seeks to learn both motion and appearance features from unlabeled video only, which can be reused for downstream tasks such as action recognition. This task, however, is extremely challenging due to 1) the highly complex spatial-temporal information in videos and 2) the lack of labeled data for training. Unlike representation learning for static images, it is difficult to construct a suitable self-supervised task to effectively model both motion and appearance features. More recently, several attempts have been made to learn video representation through video playback speed prediction. However, it is non-trivial to obtain precise speed labels for the videos. More critically, the learned models may tend to focus on motion patterns and thus may not learn appearance features well. In this paper, we observe that the relative playback speed is more consistent with motion patterns and thus provides more effective and stable supervision for representation learning. Therefore, we propose a new way to perceive the playback speed and exploit the relative speed between two video clips as labels. In this way, we are able to effectively perceive speed and learn better motion features. Moreover, to ensure the learning of appearance features, we further propose an appearance-focused task, where we enforce the model to perceive the appearance difference between two video clips. We show that jointly optimizing the two tasks consistently improves the performance on two downstream tasks (namely, action recognition and video retrieval) w.r.t the increasing pre-training epochs. Remarkably, for action recognition on the UCF101 dataset, we achieve 93.7% accuracy without the use of labeled data for pre-training, which outperforms the ImageNet supervised pre-trained model. Our code, pre-trained models, and supplementary materials can be found at https://github.com/PeihaoChen/RSPNet.

#21 Dual Distribution Alignment Network for Generalizable Person Re-Identification [PDF] [Copy] [Kimi]

Authors: Peixian Chen ; Pingyang Dai ; Jianzhuang Liu ; Feng Zheng ; Mingliang Xu ; Qi Tian ; Rongrong Ji

Domain generalization (DG) offers a preferable real-world setting for Person Re-Identification (Re-ID), which trains a model using multiple source domain datasets and expects it to perform well in an unseen target domain without any model updating. Unfortunately, most DG approaches are designed explicitly for classification tasks, which fundamentally differs from the retrieval task Re-ID. Moreover, existing applications of DG in Re-ID cannot correctly handle the massive variation among Re-ID datasets. In this paper, we identify two fundamental challenges in DG for Person Re-ID: domain-wise variations and identity-wise similarities. To this end, we propose an end-to-end Dual Distribution Alignment Network (DDAN) to learn domain-invariant features with dual-level constraints: the domain-wise adversarial feature learning and the identity-wise similarity enhancement. These constraints effectively reduce the domain-shift among multiple source domains further while agreeing to real-world scenarios. We evaluate our method in a large-scale DG Re-ID benchmark and compare it with various cutting-edge DG approaches. Quantitative results show that DDAN achieves state-of-the-art performance.

#22 RGB-D Salient Object Detection via 3D Convolutional Neural Networks [PDF] [Copy] [Kimi]

Authors: Qian Chen ; Ze Liu ; Yi Zhang ; Keren Fu ; Qijun Zhao ; Hongwei Du

RGB-D salient object detection (SOD) recently has attracted increasing research interest and many deep learning methods based on encoder-decoder architectures have emerged. However, most existing RGB-D SOD models conduct feature fusion either in the single encoder or the decoder stage, which hardly guarantees sufficient cross-modal fusion ability. In this paper, we make the first attempt in addressing RGB-D SOD through 3D convolutional neural networks. The proposed model, named RD3D, aims at pre-fusion in the encoder stage and in-depth fusion in the decoder stage to effectively promote the full integration of RGB and depth streams. Specifically, RD3D first conducts pre-fusion across RGB and depth modalities through an inflated 3D encoder, and later provides in-depth feature fusion by designing a 3D decoder equipped with rich back-projection paths (RBPP) for leveraging the extensive aggregation ability of 3D convolutions. With such a progressive fusion strategy involving both the encoder and decoder, effective and thorough interaction between the two modalities can be exploited and boost the detection accuracy. Extensive experiments on six widely used benchmark datasets demonstrate that RD3D performs favorably against 14 state-of-the-art RGB-D SOD approaches in terms of four key evaluation metrics. Our code will be made publicly available: https://github.com/PPOLYpubki/RD3D.

#23 Mind-the-Gap! Unsupervised Domain Adaptation for Text-Video Retrieval [PDF] [Copy] [Kimi]

Authors: Qingchao Chen ; Yang Liu ; Samuel Albanie

When can we expect a text-video retrieval system to work effectively on datasets that differ from its training domain? In this work, we investigate this question through the lens of unsupervised domain adaptation in which the objective is to match natural language queries and video content in the presence of domain shift at query-time. Such systems have significant practical applications since they are capable generalising to new data sources without requiring corresponding text annotations. We make the following contributions: (1) We propose the UDAVR (Unsupervised Domain Adaptation for Video Retrieval) benchmark and employ it to study the performance of text-video retrieval in the presence of domain shift. (2) We propose Concept-Aware-Pseudo-Query (CAPQ), a method for learning discriminative and transferable features that bridge these cross-domain discrepancies to enable effective target domain retrieval using source domain supervision. (3) We show that CAPQ outperforms alternative domain adaptation strategies on UDAVR.

#24 Local Relation Learning for Face Forgery Detection [PDF] [Copy] [Kimi]

Authors: Shen Chen ; Taiping Yao ; Yang Chen ; Shouhong Ding ; Jilin Li ; Rongrong Ji

With the rapid development of facial manipulation techniques, face forgery has received considerable attention in digital media forensics due to security concerns. Most existing methods formulate face forgery detection as a classification problem and utilize binary labels or manipulated region masks as supervision. However, without considering the correlation between local regions, these global supervisions are insufficient to learn a generalized feature and prone to overfitting. To address this issue, we propose a novel perspective of face forgery detection via local relation learning. Specifically, we propose a Multi-scale Patch Similarity Module (MPSM), which measures the similarity between features of local regions and forms a robust and generalized similarity pattern. Moreover, we propose an RGB-Frequency Attention Module (RFAM) to fuse information in both RGB and frequency domains for more comprehensive local feature representation, which further improves the reliability of the similarity pattern. Extensive experiments show that the proposed method consistently outperforms the state-of-the-arts on widely-used benchmarks. Furthermore, detailed visualization shows the robustness and interpretability of our method.

#25 Deductive Learning for Weakly-Supervised 3D Human Pose Estimation via Uncalibrated Cameras [PDF] [Copy] [Kimi]

Authors: Xipeng Chen ; Pengxu Wei ; Liang Lin

Without prohibitive and laborious 3D annotations, weakly-supervised 3D human pose methods mainly employ the model regularization with geometric projection consistency or geometry estimation from multi-view images. Nevertheless, those approaches explicitly need known parameters of calibrated cameras, exhibiting a limited model generalization in various realistic scenarios. To mitigate this issue, in this paper, we propose a Deductive Weakly-Supervised Learning (DWSL) for 3D human pose machine. Our DWSL firstly learns latent representations on depth and camera pose for 3D pose reconstruction. Since weak supervision usually causes ill-conditioned learning or inferior estimation, our DWSL introduces deductive reasoning to make an inference for the human pose from a view to another and develops a reconstruction loss to demonstrate what the model learns and infers is reliable. This learning by deduction strategy employs the view-transform demonstration and structural rules derived from depth, geometry and angle constraints, which improves the reliability of the model training with weak supervision. On three 3D human pose benchmarks, we conduct extensive experiments to evaluate our proposed method, which achieves superior performance in comparison with state-of-the-art weak-supervised methods. Particularly, our model shows an appealing potential for learning from 2D data captured in dynamic outdoor scenes, which demonstrates promising robustness and generalization in realistic scenarios. Our code is publicly available at https://github.com/Xipeng-Chen/DWSL-3D-pose.